Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning
نویسندگان
چکیده
We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration. The SPD methods find an absolute-optimal policy, with high probability, using O ( |S||A|σ (1−γ)6 2 ) iterations/samples for the infinite-horizon discounted-reward MDP and O ( |S||A|Hσ 2 ) for the finite-horizon MDP.
منابع مشابه
Proximal Gradient Temporal Difference Learning Algorithms
In this paper, we describe proximal gradient temporal difference learning, which provides a principled way for designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not with respect to their original objective functions as previously attempted, but rather with respect to pri...
متن کاملPrimal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems
Consider the problem of approximating the optimal policy of a Markov decision process (MDP) by sampling state transitions. In contrast to existing reinforcement learning methods that are based on successive approximations to the nonlinear Bellman equation, we propose a Primal-Dual π Learning method in light of the linear duality between the value and policy. The π learning method is model-free ...
متن کاملFinite-Sample Analysis of Proximal Gradient TD Algorithms
In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primal-dual saddle-point objective functions. We then conduct a saddle-point error analysis to obtain finite-sample bounds on their p...
متن کاملAccelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning
Constrained Markov Decision Process (CMDP) is a natural framework for reinforcement learning tasks with safety constraints, where agents learn a policy that maximizes the long-term reward while satisfying the constraints on the long-term cost. A canonical approach for solving CMDPs is the primal-dual method which updates parameters in primal and dual spaces in turn. Existing methods for CMDPs o...
متن کاملStochastic Variance Reduction Methods for Policy Evaluation
Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states’ longterm value under a given policy. In this paper, we focus on policy evaluation with linear function approximation over a fixed dataset. We first transform the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1612.02516 شماره
صفحات -
تاریخ انتشار 2016